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Abstract— This paper considers the problem of match- 
ing fragment to organism using its complete genome. Our 
method is based on the probability measure representation of 
a genome. We first demonstrate that these probability mea- 
sures can be modelled as recurrent iterated function systems 
(RIFS) consisting of four contractive similarities. Our hy- 
pothesis is that the multifractal characteristic of the probabil- 
ity measure of a complete genome, as captured by the RIFS, 
is preserved in its reasonably long fragments. We compute 
the RIFS of fragments of various lengths and random start- 
ing points, and compare with that of the original sequence 
for recognition using the Euclidean distance. A demonstra- 
tion on five randomly selected organisms supports the above 
hypothesis. 
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I. INTRODUCTION 

The DNA sequences of complete genomes provide es- 
sential information for understanding gene functions and 
evolution. A large number of these DNA sequences is cur- 
rently available in public databases such a s Genbank at 
|ftp: //ncbi. nlm.nih.gov/genbank/genomes/ or KEGG at 



ittp://www. genome. ad.jp/kegg/java/orgJist.htmj ). A 
great challenge of DNA analysis is to determine the in- 
trinsic patterns contained in these sequences which are 
formed by four basic nucleotides, namely, adenine (a), 
cytosine (c), guanine (g) and thymine (t). 

Some significant contribution results have been ob- 
tained for the long-range correlation in DNA sequences 
[1-16]. Li et al. ||l| found that the spectral density of a 
DNA sequence containing mostly introns shows 1//^ be- 
haviour, which indicates the presence of long-range cor- 
relation when < /3 < 1. The correlation properties of 
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coding and noncoding DNA sequences were first studied 
by Peng et al. in their fractal landscape or DNA walk 
model. The DNA walk |] was defined as that the walker 
steps "up" if a pyrimidine (c or t) occurs at position i 
along the DNA chain, while the walker steps "down" if a 
purine (a or g) occurs at position i. Peng et al. discov- 
ered that there exists long-range correlation in noncoding 
DNA sequences while the coding sequences correspond to 
a regular random walk. By undertaking a more detailed 
analysis, Chatzidimitriou et al. concluded that both 
coding and noncoding sequences exhibit long-range cor- 
relation. A subsequent work by Prabhu and Claverie |^ 
also substantially corroborates these results. If one con- 
siders more details by distinguishing c from t in pyrim- 
idine, and a from g in purine (such as two or three- 
dimensional DNA walk models p5[ and maps given by 
Yu and Chen |l^ ) , then the presence of base correlation 
has been found even in coding sequences. On the other 
hand, Buldyrev et al. ||l^ showed that long-range corre- 
lation appears mainly in noncoding DNA using all the 
DNA sequences available. Based on equal-symbol cor- 
relation, Voss 1^ showed a power law behaviour for the 
sequences studied regardless of the proportion of intron 
contents. These studies add to the controversy about 
the possible presence of correlation in the entire DNA or 
only in the noncoding DNA. From a different angle, frac- 
tal analysis has proven useful in revealin g, co mplex pat- 
terns in natural objects. Berthelsen et al. jl^ considered 
the global fractal dimensions of human DNA sequences 
treated as pseudorandom walks. 

In the above studies, the authors only considered short 
or long DNA segments. Since the first complete genome 
of the free-living bacterium Mycoplasma genitalium was 
sequenced in 1995 JTsf , an ever-growing number of com- 
plete genomes has been deposited in public databases. 
The availability of complete genomes induces the pos- 
sibility to establish some global properties of these se- 
quences. Vieira ||l^ carried out a low-frequency analy- 
sis of the complete DNA of 13 microbial genomes and 
showed that their fractal behaviour does not always pre- 
vail through the entire chain and the autocorrelation 
functions have a rich variety of behaviours including the 
presence of anti-persistence. Yu and Wang pro- 
posed a time series model of coding sequences in com- 
plete genomes. For fuller details on the number, size and 
ordering of genes along the chromosome, one can refer to 
Part 5 of Lewin . One may ignore the composition of 
the four kinds of bases in coding and noncoding segments 
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and only consider the global structure of the complete 
genomes or long DNA sequences. Provata and Almiran- 
tis proposed a fractal Cantor pattern of DNA. They 
mapped coding segments to filled regions and noncoding 
segments to empty regions of a random Cantor set and 
then calculated the fractal dimension of this set. They 
found that the coding/noncoding partition in DNA se- 
quences of lower organisms is homogeneous-like, while in 
the higher eucariotes the partition is fractal. This result 
doesn't seem refined enough to distinguish bacteria be- 
cause the fractal dimensions of bacteria computed |^ are 
all the same. The classification and evolution relation- 
ship of bacteria is one of the most important problems 
in DNA research. Yu and Anh |23) proposed a time se- 
ries model based on the global structure of the complete 
genome and considered three kinds of length sequences. 
After calculating the correlation dimensions and Hurst 
exponents, it was found that one can get more informa- 
tion from this model than that of fractal Cantor pattern. 
Some results on the classification and evolution relation- 
ship of bacteria were found [ p3| . The correlation property 
of these length sequences has been discussed [|4j. The 
multifractal analysis for these length sequences was done 
in(2|]. 

Although statistical analysis performed directly on 
DNA sequences has yielded some success, there has been 
some indication that this method is not powerful enough 
to amplify the diff'erence between a DNA sequence and a 
random sequence as well as to distinguish DNA sequences 
themselves in more details [ p6| . One needs more powerful 
global and visual methods. For this purpose, Hao et al. 
|P6| proposed a visualisation method based on counting 
and coarse-graining the frequency of appearance of sub- 
strings with a given length. They called it the portrait of 
an organism. They found that there exist some fractal 
patterns in the portraits which are induced by avoiding 
and under-represented strings. The fractal dimension of 
the limit set of portraits was also discussed ||2^,|2^ . There 
are other graphical methods of sequence patterns, such 
as chaos game representation p9| , |30t . 

Yu et al. introduced a representation of a DNA se- 
quence by a probability measure of fc-strings derived from 
the sequence. This probability measure is in fact the his- 
togram of the events formed by all the /c-strings in a dic- 
tionary ordering. It was found that these probability 
measures display a distinct multifractal behaviour char- 
acterised by their generalised Renyi dimensions (instead 
of a single fractal dimension as in the case of self-similar 
processes). Furthermore, the corresponding Cq curves 
(defined in [^) of these generalised dimensions of all 
bacteria resemble classical phase transition at a critical 
point, while the "analogous" phase transitions (defined 
in ) of chromosomes of nonbacteria exhibit the shape 
of double-peaked specific heat function. These patterns 
led to a meaningful grouping of archaebacteria, eubac- 
teria and eukaryote. Anh et al. [|33t took a further step 
in providing a theory to characterise the multifractality 
of the probability measures of the complete genomes. In 



particular, the resulting parametric models fit extremely 
well the Dq curves of the generalised dimensions and the 
corresponding Kq curves of the above probability mea- 
sures of the complete genomes. 

A conclusion of the work reported in Yu et al. [3l| ] and 
Anh et al. is that the histogram of the fc-strings of 
the complete genome provides a good representation of 
the genome and that these probability measures are mul- 
tifractal. This multifractality is, in most cases studied, 
characteristic of the DNA sequences, hence can be used 
for their classification. 

In this paper, we consider the problem of recognition 
of an organism based on fragments of their DNA se- 
quences. The identification of the organisms in a cul- 
ture commonly relies on their molecular identity markers 
such as the genes that code for ribosomal RNA. How- 
ever, it is usual that most fragments lack the marker, 
"making the task of matching fragment to organism akin 
to reconstructing a document that has been shredded" 
(M. Leslie, "Tales of the sea" , New Scientist^ 27 January 
2001). A well-known method to tackle the task is the 
random shotgun sequencing method, which scans the se- 
quences of all fragments looking for overlaps to be able 
to piece the fragments together. It is obvious that this 
technique is extremely time-consuming and many crucial 
fragments may be missing. 

This paper will provide a different method to approach 
this problem. Our starting point is the probability mea- 
sure of the fc-strings and its multifractality. We model 
this multifractality using a recurrent iterated function 
system ( [^,^ ) consisting of four contractive similarities 
(to be described in Section IV). This branching number 
of four is a natural consequence of the four basic elements 
{a,c,g,t) of the DNA sequences. Each of these RIFS is 
specified by a matrix of incidence probabilities P — {pij ) , 
ij = 1, ...,4, withpii+pi2+Pi3+Pi4 = 1 for i = 1, 4. It 
is our hypothesis that, for reasonably- long fragments, the 
multifractal characteristic of the measure of a complete 
genome as captured by the matrix P is preserved in the 
fragments. We thus represent each fragment by a vec- 
tor (i (pii -I- P21 + P31 + Pil) , J {Pl2 + P22 + P32 + P42) 

, J {Pi3 +P23 +P33 +P43) in R+. We wiU see that, for 
fragments of lengths longer than 1/20 of the original se- 
quence and with random starting points, these vectors 
are very close, using the Euclidean distance, to the vec- 
tor of the complete sequence. 

We will demonstrate the technique on five organisms, 
namely, A. fulgidus, B. burgdorferi, C. trachomatis, E. 
coli and M. genitalium. As remarked in Yu et al. 
substrings of length fc = 6 are sufficient to represent 
DNA sequences. For each organism, we compute the his- 
tograms for the 6-strings of its complete genome, and 4 
cases of fragments of lengths 1/4, 1/8, 1/15 and 1/20 
of the complete sequence. The starting position of each 
fragment is chosen randomly. The RIFS of the complete 
genome and each of the fragments are computed next. 
The numerical results are reported in Section V. Some 
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conclusions will be drawn in Section VI. 



II. MEASURE REPRESENTATION OF 
COMPLETE GENOMES 

We first outline the method of Yu et al. in deriv- 
ing the measure representation of a DNA sequence. We 
call any string made up of k letters from the set {g, c, a, t} 
a fc-string. For a given k there are in total 4*^ different 
/c-strings. In order to count the number of each kind of k- 
strings in a given DNA sequence, 4*^ counters are needed. 
We divide the interval [0, 1) into 4*^ disjoint subintervals, 
and use each subinterval to represent a counter. Letting 
s = si ■ ■ ■ Sk, Si ^ {a, c,g,t},i = 1, • ■ • , fc, be a substring 
with length fc, we define 
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We then use the subinterval [xi{s) , Xr{s)) to represent 
substring s. Let N{s) be the times of substring s ap- 
pearing in the complete genome. If the number of bases 
in the complete genome is L, we define 



F{s)^ N{s)l{L-k + \) 



(4) 



to be the frequency of substring s. It follows that 
^^g|F(s) = 1. We can now view F{s) as a function 
of X and define a measure ^k on [0, 1) by 



where 



(x) = Yk (x) dx, 



Yk(x) ^A'^Fkis), X e[xi{s),Xr{s)). 



(5) 



We then have ^^([0,1)) — 1 and fik {[xi{s),Xr{s))) = 
Fk{s). We call ^k (x) the measure representation of an 
organism. As an example, the measure representation 
of M. genitalium for k = 3,..., 6 is given in FIG. |l|. A 
fractal-like behaviour is apparent in the measures. 

Remark: The ordering of a,c,g,t in follows 
the natural dictionary ordering of k-strings in the one- 
dimensional space. A different ordering ofa,c,g,t would 
change the nature of the correlations of the measure. But 
in our case, a different ordering of a,c,g,t in Eq. (Qj 
gives the same multifractal spectrum (Dq curve which will 



he defined in the next section) when the absolute value 
of q is relatively small (see FIG. 2 in Hence the 

multifractal characteristic is independent of the ordering. 
In the comparison of different organisms using the mea- 
sure representation, once the ordering of a,c,q,t in 



is given, it is fixed for all organisms ^ 31 1. 



III. MULTIFRACTAL ANALYSIS 

The most common algorithms of multifractal anal- 
ysis are the so-called fixed-size box-counting algorithms 
p6| . In the one-dimensional case, for a given measure fi 
with support i? C R, we consider the partition sum 



Ze{q) 



(6) 



q 6 R, where the sum runs over all different nonempty 
boxes i? of a given side e in a grid covering of the support 
E, that is, 

B [fee, (fc + l)e[. (7) 
The exponent T{q) is defined by 

rijq) = hm (8) 
e^o loge 

and the generalized fractal dimensions of the measure are 
defined as 



Dq^T{q)/{q~l), for (7 /I, 



and 



Dq lim 

e^o log e 



, for = 1, 



(9) 
(10) 



where Zi^^ = X]/i(B)5^o ^(^) '^^^ generalized 

fractal dimensions are estimated through a linear regres- 
sion of 

1 



1 



log^e(g) 



against loge for q ^ 1, and similarly through a linear 
regression of Zi^^ against loge for q = l. Di is called in- 
formation dimension and D2 is called correlation dimen- 
sion. The Dq of the positive values of q give relevance 
to the regions where the measure is large, i.e., to the 
/c-strings with high probability. The Dq of the negative 
values of q deal with the structure and the properties of 
the most rarefied regions of the measure. 



IV. IFS AND RIFS MODELS AND THE 
MOMENT METHOD FOR PARAMETER 
ESTIMATION 

In this paper, we propose to model the measure 
defined in Section II for a complete genome by a recur- 
rent IFS. As we work with measures on compact inter- 
vals, the theory of Section II is narrowed down to the 
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one-dimensional case (i.e. d = 1). Consider a system of 
contractive maps S = {Si, 82, - ■ ■ , Sn}- Let Eq be a com- 
pact interval of R, E„^cr^...fj,^ = Scri ° Sa-2 o ■ ■ ■ o Sa„{Eo) 
and 



En 



U, 



cri,---,(T„G{l,2,---,Af}-E'CTicr. 



Then E = n'^^^Er, is the attractor of the IFS. Given 
a set of probabilities Pi > 0, X^i^iP* ~ pick an 

xq € E and define iteratively the sequence 



n = 0,1,2, 



(11) 



where the indices cr„ are chosen randomly and inde- 
pendently from the set {1,2,---,A^} with probabilities 
P(cr„ = i) — Pi. Then every orbit {a;„} is dense in the 
attractor E [ p7| , p8| . For n large enough, we can view the 
orbit {x{),xi, ■ ■ ■ ,Xn} as an approximation of E. This 
iterative process is called a chaos game. 

Given a system of contractive 

maps S ~ {S*!, 5*2, • • • , Sn} on a compact metric space 
E* , we associate with these maps a matrix of probabil- 
ities P = {pij) such that J2jPij = 1' * = 1,2, ■•■,A^. 
Consider a random sequence generated by a chaos game: 



Xn+l ^ Scr„{Xn), n = 0,l,2. 



(12) 



where xq is any starting point and (t„ is chosen with a 
probability that depends on the previous index (t„_i: 



P{(Jn+l =i) =Pa 



(13) 



The choice of the indices cr„ as prescribed by (|13|) presents 
a fundamental difference between this iterative process 
and that defined by (|l^) of the usual chaos game. Then 
{E*,S,P) is called a recurrent IFS. The flexibility of 
RIFS permits the construction of more general sets and 
measures which do not have to exhibit the strict self- 
similarity of IFS. This would offer a more suitable frame- 
work to model fractal-like objects and measures in na- 
ture. 

Let /i be the invariant measure on the attractor E of 
an IFS or RIFS, xb the characteristic function for the 
Borel subset B C E; then from the ergodic theorem for 
IFS or RIFS ISTtl, 



^l{B) 



lim 



1 



+00 n + 



1 ^ 

fc=0 



(14) 



In other words, /i(-B) is the relative visitation frequency 
of B during the chaos game. A histogram approxima- 
tion of the invariant measure may then be obtained by 
counting the number of visits made to each pixel on the 
computer screen. 

The coefficients in the contractive maps and the prob- 
abilities in the IFS or RIFS model are the parameters 
to be estimated for a given measure which we want to 
simulate. Vrscay ]38| introduced a moment method to 
perform this task. If fi is the invariant measure and E 



the attractor of the IFS or RIFS in R, the moments of fj, 
are 



Je 



90 



d/i = 1. 



(15) 



If Si{x) = CiX + di, i — 1,---,N, then the following 
well-known recursion relations hold for the IFS model: 



N 



[1-Ep^c"]5" = E 



N 



g„-,(EKcr^dD. (16) 



Thus, setting = \, the moments gn, n > 1, may be 
computed recursively from a knowledge of go, ■ ■ ■ , gn-i 

For the RIFS model, we have 



N 



(17) 



(1) 

where gii , j = I, ■ ■ ■ , N , are given by the solution of the 
following system of linear equations: 



E(p..cr-5,,)5W 



n — 1 / 
k=0 ^ 



N 



i = l,---,N, n> 1. 



(18) 



For rt = 0, we set gj^"^ — rrii, where nii are given by the 
solution of the linear equations 



1,2,---,7V, go = ^m, = l. 



(19) 

If we denote by Gk the moments obtained directly from 
a given measure using (^|) , and gk the formal expression 
of moments obtained from ( p^ ) for the IFS model or from 
( pTfjlgj ) for the RIFS model, then through solving the 
optimal problem 



mm 

[^Pi or Pi 



fe=i 



{9k - Gkf 



for some chosen 



n, 



(20) 



we can obtain the estimates of the parameters in the IFS 
or RIFS model. 

From the measure representation of a complete 
genome, it is natural to choose TV = 4 and 

Si{x)^x/4, S2{x)^x/4+l/4, 
Ssix) = x/4 -f 1/2, Si{x) = x/A -f 3/4 

in the IFS or RIFS model. Based on the estimated values 
of the probabilities, we can use the chaos game to gener- 
ate a histogram approximation of the invariant measure 
of the IFS or RIFS, which then can be compared with 
the given measure of the complete genome. 
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V. APPLICATION TO THE RECOGNITION 
PROBLEM 

The measure representations for a large number of 
complete genomes, as described in Section II, were ob- 
tained in Yu et al. js^. It was found that substrings 
with k — Q seem to provide a limiting measure that can 
be used for the classification and recognition of DNA se- 
quences. Hence we will use 6-strings in this paper. We 
then estimated their IFS and RIFS models using the mo- 
ment method described in Section 4. The chaos game 
algorithm was next performed to generate an orbit as in 
( pi] ) or (|l^) with (|l^). From these orbits, simulated ap- 
proximations of the invariant measures of IFS or RIFS 
were obtained via the ergodic theorem (p^. In order to 
clarify how close the simulated measure is to the original 
measure, we convert a measure to its walk representa- 
tion: We denote by {tj, j = 1,2, ■•■,4'^} the density 
of a measure and tave its average, then define the walk 
Tj = ELiitk - tave), j = 1, 2, • • • , 4^ The two walks 
of the given measure and the measure generated by the 
chaos game of an IFS or RIFS are then plotted in the 
same figure for comparison. We found that RIFS is a 
better model to simulate complete genomes. We deter- 
mine the " goodness" of the measure simulated from the 
RIFS model relative to the original measure based on the 
following relative standard error (RSE) 



RSE 



RAISE 
SE ' 



where 



RMSE 



\ 



and 



SE 



\ 



1 

76 ^Z'-^J ~ tavey 
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(^j)^=i and (ij)j=i being the densities of the original 
measure and the RIFS simulated measure respectively. 
The goodness of fit is indicated by the result RSE < 1 . 
For example, the RIFS simulation of 6-strings measure 
representation of M. genitalium is shown in the left figure 
of FIG. |[ and the walk of its original 6-strings measure 
representation and that simulated from the correspond- 
ing RIFS are shown in the right figure of FIG. ^ For the 
whole genome, RMSE = 0.00020675, SE = 0.0003207 
and RSD = 0.6447 < 1. It is seen that the RIFS 
simulation fits the original measure very well. 

We next pick out five organisms (without any partic- 
ular a priori reason) from about 50 organisms whose 
complete genomes are currently available. These are A. 
fulgidus, B. burgdorferi, C. trachomatis, E. coli and M. 
genitalium. Fragments of different length rates ranging 



from 1/20 to 1/4 and with random starting points along 
the sequences were then selected. Here the length rate 
of a fragment means the length of this fragment divided 
by the length of the genome of the same organism. For 
example, the measure representations of different frag- 
ments of M. genitalium are shown in FIG. |. The RIFS 
model for each of these fragments was next estimated. 
We also show the RIFS simulation of the 6-strings mea- 
sure representation of the 1/20 fragment of M. genital- 
ium in the left figure of FIG. ^. The walk of its original 
6-strings measure representation and that of RIFS sim- 
ulation are shown in the right figure of FIG. ^. For this 
fragment, RMSE = 0.00023169, SE = 0.00035475 and 
RSD = 0.6531 < 1. Again, the RIFS simulation fits 
the original measure of this fragment very well. 

It should be noted that column i in the matrix P de- 
scribes the activity of similarity Si in each RIFS. To be 
able to represent each fragment on a three-dimensional 
plot, we define 



Pi = (Pii +P21 +P31 +P4l)/4, 
P2 = {Pl2 +P22 +P32 +P42)/4, 
^3 = {Pl3 +P23 +P33 +P43)/4. 



(21) 



Each fragment is then represented by the vector 
(Pi, P2, -P3) • The values of these vectors are provided in 
Table ||, and the vectors are plotted in FIG. |. It IS seen 
that the vectors of the fragments from the same organism 
cluster together, and this clustering holds for all selected 
lengths. This accuracy is uniform for all five organisms 
randomly selected. 

In matching a fragment to organism, the Dq curve, 
which depicts the generalised dimension of the invariant 
measure as described in Section HI, can also be used. 
We computed these curves for the above five organisms 
at a variety of length sizes, to 1/lOOth of the original se- 
quence. The results were reported for M. genitalium in 
FIG. I It is seen that this method also performs very 
well. However, it suffers a drawback that many differ- 
ent organisms seem to have the same or closely related 
Dq curve. In this sense, the method based on the RIFS 
has higher resolution in distinguishing the genomes. If 
necessary, the entire matrix P may be used, instead of 
(|l|), in this comparison. This would enhance the match- 
ing, but will not be as economical as (|2l|). Yu et al. ]39| ] 
used the entire matrix P to define the distance between 
two organisms in higher dimensional space and then the 
evolutionary tree of more than 50 organisms was con- 
structed. The RIFS model can also be used to simulate 
the measure representation of proteins based on the HP 
model |40|. 



VI. CONCLUSION 

This paper provides a method for matching fragment 
to organism taking advantage of the multifractal charac- 
teristic of the measure representation of their genomes. It 
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was demonstrated empirically that the underlying mech- 
anism of this multifractality can be captured by a recur- 
rent IFS, whose theory is well founded in the fractal ge- 
ometry literature. Fast algorithms for the computation of 
these RIFS and related quantities as well as tools for com- 
parison are available. The method seems to work reason- 
ably well with low computing cost. This fast and econom- 
ical method can be performed at a preliminary stage to 
cluster fragments before a more extensive method, such 
as the random shotgun sequencing method as mentioned 
in the Introduction, is decided to be brought in for higher 
accuracy. 
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TABLE I. Values of vector representation (Pi, P2, f's) of fragments from the five organisms. 



Organism 


Sequence 


Pi 


P2 


P3 




1/4 fragment 


0.255114 


0.248454 


0.234208 




1/8 fragment 


0.257610 


0.248891 


0.232988 


A. fulgidus 


1/15 fragment 


0.260611 


0.245235 


0.229882 




1/20 fragment 


0.253536 


0.247569 


0.233501 




whole genome 


0.257277 


0.248579 


0.233379 




1/4 fragment 


0.305165 


0.160478 


0.165485 




1/8 fragment 


0.303635 


0.160063 


0.166952 


B. burgdorferi 


1/15 fragment 


0.351298 


0.188586 


0.135497 




1/20 fragment 


0.310800 


0.163463 


0.162279 




whole genome 


0.335605 


0.173103 


0.143191 




1/4 fragment 


0.293139 


0.226877 


0.197907 




1/8 fragment 


0.275901 


0.220717 


0.206184 


C. trachomatis 


1/15 fragment 


0.299231 


0.226269 


0.194245 




1/20 fragment 


0.293706 


0.219299 


0.192447 




whole genome 


0.284452 


0.223418 


0.201998 




1/4 fragment 


0.253291 


0.253147 


0.237551 




1/8 fragment 


0.250753 


0.250494 


0.240300 


E. coli 


1/15 fragment 


0.256441 


0.248731 


0.232963 




1/20 fragment 


0.252115 


0.252027 


0.237276 




whole genome 


0.248986 


0.255393 


0.242893 




1/4 fragment 


0.339263 


0.165702 


0.140649 




1/8 fragment 


0.335415 


0.187653 


0.158851 


M. genitalium 


1/15 fragment 


0.337408 


0.173610 


0.144801 




1/20 fragment 


0.336145 


0.182237 


0.149540 




whole genome 


0.335212 


0.175269 


0.147534 
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FIG. 1. Histograms of substrings with different lengths 
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Simulated measure using RIFS model 
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FIG. 4. Left): Simulation of the measure representation (6-strings) of 1/20 fragment of M. genitalium using the recurrent IFS model. 
Right): Walk comparison for measure representation (6-strings) of 1/20 fragment of M. genitalium and its RIFS simulation. 
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FIG. 5. Vector representation {Pi,P2,P3)) of all fragments from five organisms. 




FIG. 6. The dimension spectra of fragments from M. genitalium. 
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